COVID 19 Data Report

Please knit this file with HTML as it has 3D plots which can be visualized in HTML Page only.

This report analyzes COVID-19 data from 2020 through 2023. The dataset, compiled by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, documents the impact of COVID-19 across the United States and globally.

Data Source: - https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

GitHub - https://github.com/Cosmic-AJ/Mini-Projects/tree/main/COVID-19%20Data%20Report

Library Imports

library(readr)
library(tidyverse)
library(party)
library(caret)
library(e1071) 
library(usmap)
library(ggplot2)
library(rworldmap)
library(plotly)
library(nnet)
library(usmap)
library(ggnewscale)

Importing the data

The data are sourced directly from publicly available GitHub CSV files maintained by Johns Hopkins University.

url_in = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/refs/heads/master/csse_covid_19_data/csse_covid_19_time_series/'
filesnames = c('time_series_covid19_confirmed_US.csv',
               'time_series_covid19_confirmed_global.csv',
               'time_series_covid19_deaths_US.csv',
               'time_series_covid19_deaths_global.csv',
               'time_series_covid19_recovered_global.csv'
               )
urls = str_c(url_in, filesnames)

us_cases = read_csv(urls[1])
## Rows: 3342 Columns: 1154
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (6): iso2, iso3, Admin2, Province_State, Country_Region, Combined_Key
## dbl (1148): UID, code3, FIPS, Lat, Long_, 1/22/20, 1/23/20, 1/24/20, 1/25/20...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
global_cases = read_csv(urls[2])
## Rows: 289 Columns: 1147
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (2): Province/State, Country/Region
## dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
us_deaths = read_csv(urls[3])
## Rows: 3342 Columns: 1155
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (6): iso2, iso3, Admin2, Province_State, Country_Region, Combined_Key
## dbl (1149): UID, code3, FIPS, Lat, Long_, Population, 1/22/20, 1/23/20, 1/24...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
global_deaths = read_csv(urls[4])
## Rows: 289 Columns: 1147
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (2): Province/State, Country/Region
## dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
global_recovered = read_csv(urls[5])
## Rows: 274 Columns: 1147
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (2): Province/State, Country/Region
## dbl (1145): Lat, Long, 1/22/20, 1/23/20, 1/24/20, 1/25/20, 1/26/20, 1/27/20,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
uid_lookup_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/refs/heads/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv"

uid = read_csv(uid_lookup_url) %>%
  select(-c(Lat, Long_, Combined_Key, code3, iso2, iso3, Admin2))
## Rows: 4321 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): iso2, iso3, FIPS, Admin2, Province_State, Country_Region, Combined_Key
## dbl (5): UID, code3, Lat, Long_, Population
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning and Transformation

Unnecessary columns are removed and data is reshaped to facilitate time series and spatial analyses. Feature engineering is used to derive new summary statistics.

global_cases = global_cases %>%
  pivot_longer(cols = -c('Province/State', 'Country/Region', 'Lat', 'Long'),
               names_to='date',
               values_to='cases') %>%
  select(-c(Lat, Long))

global_deaths = global_deaths %>%
  pivot_longer(cols = -c('Province/State', 'Country/Region', 'Lat', 'Long'),
               names_to='date',
               values_to='deaths') %>%
  select(-c(Lat, Long))

global_recovered = global_recovered %>%
  pivot_longer(cols = -c('Province/State', 'Country/Region', 'Lat', 'Long'),
               names_to='date',
               values_to='recovered') %>%
  select(-c(Lat, Long))

us_cases = us_cases %>%
  pivot_longer(cols=-(UID:Combined_Key),
               names_to='date',
               values_to='cases') %>%
  select(Admin2:cases) %>%
  mutate(date=mdy(date)) %>%
  select(-c(Lat, Long_))

us_deaths = us_deaths %>%
  pivot_longer(cols=-(UID:Population),
               names_to='date',
               values_to='deaths') %>%
  select(Admin2:deaths) %>%
  mutate(date=mdy(date)) %>%
  select(-c(Lat, Long_))

Data tables are joined to combine cases, deaths, and recovered counts.

global = global_cases %>%
  full_join(global_deaths) %>%
  full_join(global_recovered) %>%
  rename(Country_Region='Country/Region',
         Province_State ='Province/State') %>%
  mutate(date=mdy(date))
## Joining with `by = join_by(`Province/State`, `Country/Region`, date)`
## Joining with `by = join_by(`Province/State`, `Country/Region`, date)`
global = global %>%
  filter(cases>0)

us = us_cases %>%
  full_join((us_deaths))
## Joining with `by = join_by(Admin2, Province_State, Country_Region,
## Combined_Key, date)`

Further join and cleanup for global data

global = global %>%
  unite('Combined_Key',
       c(Province_State, Country_Region),
       sep=",",
       na.rm=TRUE,
       remove=FALSE)

global = global %>%
  left_join(uid, by=c("Province_State", "Country_Region")) %>%
  select(-c(UID, FIPS)) %>%
  select(Province_State, Country_Region, date, cases, deaths, recovered, Population, Combined_Key)

Aggregate at the US state and national levels, and compute new daily cases/deaths

us_by_state = us %>% 
  group_by(Province_State, Country_Region, date) %>%
  summarize(cases=sum(cases), deaths=sum(deaths),
            population=sum(Population)) %>%
  mutate(deaths_per_mill=deaths*1000000/population) %>%
  select(Province_State, Country_Region, date, cases, deaths, deaths_per_mill, population) %>%
  ungroup()
## `summarise()` has grouped output by 'Province_State', 'Country_Region'. You can
## override using the `.groups` argument.
us_totals = us_by_state %>% 
  group_by(Country_Region, date) %>%
  summarize(cases=sum(cases), deaths=sum(deaths),
            population=sum(population)) %>%
  mutate(deaths_per_mill=deaths*1000000/population) %>%
  select(Country_Region, date, cases, deaths, deaths_per_mill, population) %>%
  ungroup()
## `summarise()` has grouped output by 'Country_Region'. You can override using
## the `.groups` argument.
max(us_by_state$date)
## [1] "2023-03-09"
min(us_by_state$date)
## [1] "2020-01-22"
us_by_state =  us_by_state %>% 
  mutate(new_cases = cases-lag(cases),
         new_deaths=deaths-lag(deaths))

us_totals =  us_totals %>% 
  mutate(new_cases = cases-lag(cases),
         new_deaths= deaths-lag(deaths))

tail(us_totals %>% select(new_cases, new_deaths, everything()))
## # A tibble: 6 × 8
##   new_cases new_deaths Country_Region date          cases deaths deaths_per_mill
##       <dbl>      <dbl> <chr>          <date>        <dbl>  <dbl>           <dbl>
## 1      2147          7 US             2023-03-04   1.04e8 1.12e6           3371.
## 2     -3862        -38 US             2023-03-05   1.04e8 1.12e6           3371.
## 3      8564         47 US             2023-03-06   1.04e8 1.12e6           3371.
## 4     35371        335 US             2023-03-07   1.04e8 1.12e6           3372.
## 5     64861        730 US             2023-03-08   1.04e8 1.12e6           3374.
## 6     46931        590 US             2023-03-09   1.04e8 1.12e6           3376.
## # ℹ 1 more variable: population <dbl>

Summarize global COVID-19 statistics at the country level

global_totals = global %>% 
  group_by(Country_Region, date) %>%
  summarize(cases=sum(cases), deaths=sum(deaths),
            recovered=sum(recovered),
            population=sum(Population)) %>%
  mutate(deaths_per_mill=deaths*1000000/population) %>%
  select(Country_Region, date, cases, deaths, recovered, deaths_per_mill, population) %>%
  ungroup()
## `summarise()` has grouped output by 'Country_Region'. You can override using
## the `.groups` argument.
global_totals =  global_totals %>% 
  mutate(new_cases = cases-lag(cases),
         new_deaths=deaths-lag(deaths),
         new_recovered=recovered-lag(recovered))

global_totals = global_totals %>%
  group_by(Country_Region) %>%
  summarize(deaths=max(deaths), 
            cases=max(cases),
            recovered=max(recovered),
            population=max(population),
            active=cases-recovered,
            cases_per_thou = 1000*cases/population,
            deaths_per_thou = 1000*deaths/population,
            recovered_per_thou = 1000*recovered/population) %>%
  filter(cases>0, population>0)

Identify countries with highest/lowest death and recovery rates. It also reflects the country with maximum active cases.

global_totals %>%
  slice_min(deaths_per_thou, n=10) %>%
  select(deaths_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 9
##    deaths_per_thou cases_per_thou Country_Region deaths cases recovered
##              <dbl>          <dbl> <chr>           <dbl> <dbl>     <dbl>
##  1        0            35.8       Holy See            0    29        27
##  2        0           240.        Tuvalu              0  2828         0
##  3        0.000233      0.0000388 Korea, North        6     1         0
##  4        0.00320       4.51      Burundi            38 53631       773
##  5        0.0118        0.467     Chad              194  7679      4796
##  6        0.0123        1.64      South Sudan       138 18368     10514
##  7        0.0130        0.393     Niger             315  9508      5351
##  8        0.0131        1.86      Tajikistan        125 17786     14867
##  9        0.0134        2.31      Benin             163 27999      8136
## 10        0.0142        0.718     Tanzania          846 42906       183
## # ℹ 3 more variables: population <dbl>, active <dbl>, recovered_per_thou <dbl>
global_totals %>%
  slice_max(deaths_per_thou, n=10) %>%
  select(deaths_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 9
##    deaths_per_thou cases_per_thou Country_Region         deaths  cases recovered
##              <dbl>          <dbl> <chr>                   <dbl>  <dbl>     <dbl>
##  1            6.66           136. Peru                   219539 4.49e6   2086086
##  2            5.50           187. Bulgaria                38228 1.30e6    398721
##  3            5.05           227. Hungary                 48762 2.20e6    749773
##  4            4.96           122. Bosnia and Herzegovina  16280 4.02e5    189710
##  5            4.64           166. North Macedonia          9662 3.47e5    150440
##  6            4.47           460. Montenegro               2808 2.89e5     99152
##  7            4.38           309. Croatia                 17987 1.27e6    354830
##  8            4.25           458. Georgia                 16971 1.83e6    390827
##  9            3.97           431. Czechia                 42491 4.62e6   1641321
## 10            3.87           491. Slovakia                21035 2.67e6    255300
## # ℹ 3 more variables: population <dbl>, active <dbl>, recovered_per_thou <dbl>
global_totals %>%
  slice_min(recovered_per_thou, n=10) %>%
  select(recovered_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 9
##    recovered_per_thou cases_per_thou Country_Region deaths   cases recovered
##                 <dbl>          <dbl> <chr>           <dbl>   <dbl>     <dbl>
##  1            0           42.6       Kiribati           18    5014         0
##  2            0            0.0000388 Korea, North        6       1         0
##  3            0          484.        Nauru               1    5247         0
##  4            0          333.        Palau               9    5991         0
##  5            0          267.        Sweden          23777 2699339         0
##  6            0          159.        Tonga              13   16810         0
##  7            0          240.        Tuvalu              0    2828         0
##  8            0.00306      0.718     Tanzania          846   42906       183
##  9            0.00879    210.        Micronesia         61   23948         1
## 10            0.0103      41.0       Vanuatu            14   12014         3
## # ℹ 3 more variables: population <dbl>, active <dbl>, deaths_per_thou <dbl>
global_totals %>%
  slice_max(recovered_per_thou, n=10) %>%
  select(recovered_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 9
##    recovered_per_thou cases_per_thou Country_Region deaths   cases recovered
##                 <dbl>          <dbl> <chr>           <dbl>   <dbl>     <dbl>
##  1               186.           620. Andorra           165   47890     14380
##  2               182.           515. Seychelles        172   50665     17874
##  3               158.           460. Montenegro       2808  288808     99152
##  4               157.           418. Bahrain          1553  710693    267220
##  5               153.           431. Czechia         42491 4618256   1641321
##  6               148.           696. San Marino        122   23616      5009
##  7               139.           344. Maldives          311  185738     75095
##  8               122.           641. Slovenia         7078 1331707    253972
##  9               116.           507. Luxembourg       1220  317367     72306
## 10               108.           298. Uruguay          7617 1034303    374203
## # ℹ 3 more variables: population <dbl>, active <dbl>, deaths_per_thou <dbl>
global_totals %>%
  slice_max(active, n=10) %>%
  select(active, cases_per_thou, everything())
## # A tibble: 10 × 9
##      active cases_per_thou Country_Region  deaths     cases recovered population
##       <dbl>          <dbl> <chr>            <dbl>     <dbl>     <dbl>      <dbl>
##  1 97504620          315.  US             1123836 103802702   6298082  329466283
##  2 39451607          585.  France          166176  39866718    415111   68128061
##  3 34589800          460.  Germany         168935  38249060   3659260   83155031
##  4 32467987          263.  Japan            72997  33320438    852451  126476458
##  5 30434803          597.  Korea, South     34093  30615522    180719   51269183
##  6 24642334          360.  United Kingdom  220721  24658705     16371   68403187
##  7 21458902          423.  Italy           188322  25603510   4144608   60461828
##  8 19309981          174.  Brazil          699276  37081209  17771228  212559409
##  9 16466176          151.  Russia          388478  22075858   5609682  145934460
## 10 13715990           32.4 India           530779  44690738  30974748 1380004385
## # ℹ 2 more variables: deaths_per_thou <dbl>, recovered_per_thou <dbl>

Data Visualization

COVID-19 Cases and Deaths in the US Over Time

The following plot shows cumulative COVID-19 cases and deaths over time in the United States.

us_totals %>%
  filter(cases > 0) %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = cases, color = "Cases")) +
  geom_point(aes(y = cases, color = "Cases")) +
  geom_line(aes(y = deaths, color = "Deaths")) +
  geom_point(aes(y = deaths, color = "Deaths")) +
  scale_y_log10() +
  labs(title = "Cumulative COVID-19 Cases and Deaths in the US",
       y = "Count (log scale)",
       color = "Metric") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 90, hjust = 1))

New Cases and Deaths in the US Over Time

This plot shows new COVID-19 cases and deaths per day in the US.

us_totals %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = new_cases, color = "New Cases")) +
  geom_point(aes(y = new_cases, color = "New Cases")) +
  geom_line(aes(y = new_deaths, color = "New Deaths")) +
  geom_point(aes(y = new_deaths, color = "New Deaths")) +
  scale_y_log10() +
  labs(title = "Daily New COVID-19 Cases and Deaths in the US",
       y = "Count (log scale)",
       color = "Metric") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 90, hjust = 1))

US States: Cases and Deaths Per Thousand - Aggregate cases and deaths per thousand by US state. Display states with highest and lowest death rates.

us_state_totals = us_by_state %>%
  group_by(Province_State) %>%
  summarize(deaths=max(deaths), cases=max(cases),
            population=max(population),
            cases_per_thou = 1000*cases/population,
            deaths_per_thou = 1000*deaths/population) %>%
  filter(cases>0, population>0)
  
us_state_totals %>%
  slice_min(deaths_per_thou, n=10) %>%
  select(deaths_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 6
##    deaths_per_thou cases_per_thou Province_State        deaths  cases population
##              <dbl>          <dbl> <chr>                  <dbl>  <dbl>      <dbl>
##  1           0.611           150. American Samoa            34 8.32e3      55641
##  2           0.744           248. Northern Mariana Isl…     41 1.37e4      55144
##  3           1.21            231. Virgin Islands           130 2.48e4     107268
##  4           1.30            269. Hawaii                  1841 3.81e5    1415872
##  5           1.49            245. Vermont                  929 1.53e5     623989
##  6           1.55            293. Puerto Rico             5823 1.10e6    3754939
##  7           1.65            340. Utah                    5298 1.09e6    3205958
##  8           2.01            415. Alaska                  1486 3.08e5     740995
##  9           2.03            252. District of Columbia    1432 1.78e5     705749
## 10           2.06            253. Washington             15683 1.93e6    7614893
us_state_totals %>%
  slice_min(deaths_per_thou, n=10) %>%
  select(deaths_per_thou, cases_per_thou, everything())
## # A tibble: 10 × 6
##    deaths_per_thou cases_per_thou Province_State        deaths  cases population
##              <dbl>          <dbl> <chr>                  <dbl>  <dbl>      <dbl>
##  1           0.611           150. American Samoa            34 8.32e3      55641
##  2           0.744           248. Northern Mariana Isl…     41 1.37e4      55144
##  3           1.21            231. Virgin Islands           130 2.48e4     107268
##  4           1.30            269. Hawaii                  1841 3.81e5    1415872
##  5           1.49            245. Vermont                  929 1.53e5     623989
##  6           1.55            293. Puerto Rico             5823 1.10e6    3754939
##  7           1.65            340. Utah                    5298 1.09e6    3205958
##  8           2.01            415. Alaska                  1486 3.08e5     740995
##  9           2.03            252. District of Columbia    1432 1.78e5     705749
## 10           2.06            253. Washington             15683 1.93e6    7614893

Total COVID-19 Cases in US States

This map shows total reported cases by US state. California has reported the maximum cases in the country.

us_state_plot_data = us_state_totals %>%
  select(Province_State, cases) %>%
  rename(state = "Province_State")

plot_usmap(data = us_state_plot_data,
           values = "cases",
           color = "blue",
           labels = TRUE) +
  scale_fill_continuous(name = "Total Cases", label = scales::comma) +
  labs(title = "Total COVID-19 Cases by US State") +
  theme(panel.background = element_rect(color = "black", fill = "lightblue"))

Total COVID-19 Deaths in US States

This map shows total deaths reported in each US state. The below spatial analysis shows that most number of deaths have been reported in California followed by Texas. The least number of deaths are reported in American Samoa.

us_state_deaths_plot_data = us_state_totals %>%
  select(Province_State, deaths) %>%
  rename(state = "Province_State")

plot_usmap(data = us_state_deaths_plot_data,
           values = "deaths",
           color = "red",
           labels = TRUE) +
  scale_fill_continuous(name = "Total Deaths", label = scales::comma) +
  labs(title = "Total COVID-19 Deaths by US State") +
  theme(panel.background = element_rect(color = "black", fill = "lightblue"))

Global COVID-19 Cases by Country

The following map visualizes total COVID-19 cases per country.

global_plot_data = global_totals %>%
  select(Country_Region, cases) %>%
  rename(country = "Country_Region")

CASES <- global_plot_data %>%
  group_by(country) %>%
  summarise(`Total cases` = sum(cases)) %>%
  mutate(country = gsub("_", " ", country))

COVID.map <- joinCountryData2Map(CASES, joinCode = "NAME", nameJoinColumn = "country")
## 187 codes from your data successfully matched countries in the map
## 7 codes from your data failed to match with a country code in the map
## 56 codes from the map weren't represented in your data
par(mar = c(0, 0, 1, 0))
mapCountryData(COVID.map, nameColumnToPlot = "Total cases", colourPalette = 'topo')

Global COVID-19 Deaths by Country

The below graph shows the deaths caused by Covid 19 for all of the countries in the world.

global_deaths_plot_data = global_totals %>%
  select(Country_Region, deaths) %>%
  rename(country = "Country_Region")

CASES <- global_deaths_plot_data %>%
  group_by(country) %>%
  summarise(`Total deaths` = sum(deaths)) %>%
  mutate(country = gsub("_", " ", country))

COVID.map <- joinCountryData2Map(CASES, joinCode = "NAME", nameJoinColumn = "country")
## 187 codes from your data successfully matched countries in the map
## 7 codes from your data failed to match with a country code in the map
## 56 codes from the map weren't represented in your data
par(mar = c(0,0,1,0))
mapCountryData(COVID.map, nameColumnToPlot = "Total deaths", colourPalette = 'heat')

Global Recoveries

The below graph shows the recoveries from Covid 19 for all of the countries in the world.

global_recovered_plot_data = global_totals %>%
  select(Country_Region, recovered) %>%
  rename(country = "Country_Region")

CASES <- global_recovered_plot_data %>%
  group_by(country) %>%
  summarise(`Total recovered` = sum(recovered)) %>%
  mutate(country = gsub("_", " ", country))

COVID.map <- joinCountryData2Map(CASES, joinCode = "NAME", nameJoinColumn = "country")
## 187 codes from your data successfully matched countries in the map
## 7 codes from your data failed to match with a country code in the map
## 56 codes from the map weren't represented in your data
par(mar = c(0,0,1,0))
mapCountryData(COVID.map, nameColumnToPlot = "Total recovered", colourPalette = 'terrain')

Global Active Cases

The below graph shows the total active cases of Covid 19 for all of the countries in the world.

global_active_plot_data = global_totals %>%
  select(Country_Region, active) %>%
  rename(country = "Country_Region")

CASES <- global_active_plot_data %>%
  group_by(country) %>%
  summarise(`Total active` = sum(active)) %>%
  mutate(country = gsub("_", " ", country))

COVID.map <- joinCountryData2Map(CASES, joinCode = "NAME", nameJoinColumn = "country")
## 187 codes from your data successfully matched countries in the map
## 7 codes from your data failed to match with a country code in the map
## 56 codes from the map weren't represented in your data
par(mar = c(0,0,1,0))
mapCountryData(COVID.map, nameColumnToPlot = "Total active", colourPalette = 'rainbow')

Modelling

The COVID 19 data is modeled to predict the number of deaths with relation to all of the cases reported in us and world. We use linear regression, SVM, and neural network models to predict COVID-19 death rates based on cases and recoveries, comparing model performance.

Linear Regression

Predicting deaths per thousand using cases per thousand (US state-level data). Linear Regression tries to use one or more factors to identify linear relationship with the resultant value.

mod = lm(deaths_per_thou ~ cases_per_thou, data=us_state_totals)
summary(mod)
## 
## Call:
## lm(formula = deaths_per_thou ~ cases_per_thou, data = us_state_totals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3352 -0.5978  0.1491  0.6535  1.2086 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.36167    0.72480  -0.499     0.62    
## cases_per_thou  0.01133    0.00232   4.881 9.76e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8615 on 54 degrees of freedom
## Multiple R-squared:  0.3061, Adjusted R-squared:  0.2933 
## F-statistic: 23.82 on 1 and 54 DF,  p-value: 9.763e-06

The below graph shows the linear regression for us state data. It compares the actual data points and the ones predicted using the model.

us_total_w_pred = us_state_totals %>% 
  mutate(pred=predict(mod))

us_total_w_pred %>%
  ggplot() +
  geom_point(aes(x = cases_per_thou, y = deaths_per_thou), color = "blue") +
  geom_point(aes(x = cases_per_thou, y = pred), color = "red") +
  labs(
    title = "Linear Regression: Observed vs. Predicted Deaths per Thousand (US States)",
    x = "Cases per Thousand",
    y = "Deaths per Thousand"
  ) +
  theme_minimal()

The below Linear regression model predicts the death rate using the total cases for all the countries across the globe. The graph shows the predicted and actual values for global dataset.

mod = lm(deaths_per_thou ~ cases_per_thou, data=global_totals)
summary(mod)
## 
## Call:
## lm(formula = deaths_per_thou ~ cases_per_thou, data = global_totals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4069 -0.6090 -0.3898  0.4689  5.5423 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.6277935  0.1116605   5.622 6.57e-08 ***
## cases_per_thou 0.0035877  0.0004393   8.167 4.13e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.151 on 192 degrees of freedom
## Multiple R-squared:  0.2578, Adjusted R-squared:  0.254 
## F-statistic:  66.7 on 1 and 192 DF,  p-value: 4.127e-14
global_totals_w_pred = global_totals %>% 
  mutate(pred=predict(mod))

global_totals_w_pred %>%
  ggplot() +
  geom_point(aes(x = cases_per_thou, y = deaths_per_thou), color = "blue") +
  geom_point(aes(x = cases_per_thou, y = pred), color = "red") +
  labs(
    title = "Linear Regression: Observed vs. Predicted Deaths per Thousand (Global)",
    x = "Cases per Thousand",
    y = "Deaths per Thousand"
  ) +
  theme_minimal()

Multiple Linear Regression with Cases and Recovery and 3D Plot for Predicted and actual values

The below Linear regression model predicts the death rate using the total cases and recovery rate for all the countries accross the globe. It also shows the 3 dimensional graphs where the predicted and actual values are shown in 2 different graphs.

mod = lm(deaths_per_thou ~ cases_per_thou +  recovered_per_thou, data=global_totals)
summary(mod)
## 
## Call:
## lm(formula = deaths_per_thou ~ cases_per_thou + recovered_per_thou, 
##     data = global_totals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7061 -0.4499 -0.3471  0.4326  4.9949 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.4450801  0.1035210   4.299 2.73e-05 ***
## cases_per_thou     0.0018710  0.0004654   4.020 8.36e-05 ***
## recovered_per_thou 0.0152332  0.0022007   6.922 6.58e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.031 on 191 degrees of freedom
## Multiple R-squared:  0.4067, Adjusted R-squared:  0.4005 
## F-statistic: 65.46 on 2 and 191 DF,  p-value: < 2.2e-16
global_totals_w_pred = global_totals %>% 
  mutate(pred=predict(mod))

global_totals_w_pred %>%
  ggplot() +
  geom_point(aes(x = cases_per_thou, y = deaths_per_thou), color = "blue") +
  geom_point(aes(x = cases_per_thou, y = pred), color = "red") +
  labs(
    title = "Multiple Linear Regression: Deaths per Thousand vs. Cases and Recovery",
    x = "Cases per Thousand",
    y = "Deaths per Thousand"
  ) +
  theme_minimal()

plot_ly(data = global_totals_w_pred, x = ~cases_per_thou, y = ~recovered_per_thou, z = ~pred,
        type = 'scatter3d', mode = 'markers',
        marker = list(size = 5, color = ~pred, colorscale = 'Viridis')) %>%
  layout(scene = list(xaxis = list(title = 'Cases per Thousand'),
                      yaxis = list(title = 'Recovered Per Thousand'),
                      zaxis = list(title = 'Predicted Deaths per Thousand')))
plot_ly(data = global_totals_w_pred, x = ~cases_per_thou, y = ~recovered_per_thou, z = ~deaths_per_thou,
        type = 'scatter3d', mode = 'markers',
        marker = list(size = 5, color = ~deaths_per_thou, colorscale = 'Viridis')) %>%
  layout(scene = list(xaxis = list(title = 'Cases per Thousand'),
                      yaxis = list(title = 'Recovered Per Thousand'),
                      zaxis = list(title = 'Deaths per Thousand')))

Model Performance Metrics: The below graph shows the performance of the model. It reflects the actual and predicted values with a Root Mean Squared Value of 1.14

rmse <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}
actual_values <- global_totals_w_pred$deaths_per_thou
predicted_values <- global_totals_w_pred$pred

print(paste("RMSE:", rmse(actual_values, predicted_values)))
## [1] "RMSE: 1.02340386994039"
print(paste("R2:", caret::R2(predicted_values, actual_values)))
## [1] "R2: 0.406675841326311"
plot(actual_values, type = "l", col = "blue", main = "Actual vs Predicted Values", xlab = "Index", ylab = "Deaths per Thousand")
lines(predicted_values, col = "red")
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "red"), lty = 1)

SVM

The below code uses SVM model using the total cases and recoveries for covid 19 for global data. The performance of this model is not good as compared to Linear Regression because of high RMSE.

regressor <- svm(deaths_per_thou ~ cases_per_thou + recovered_per_thou,
                 data = global_totals,
                 kernel = "linear", cost = 10, scale = FALSE)
predictions <- predict(regressor, newdata = global_totals)

print(paste("SVM RMSE:", rmse(global_totals$deaths_per_thou, predictions)))
## [1] "SVM RMSE: 1.31319395596001"
plot(global_totals$deaths_per_thou, type = "l", col = "blue", main = "SVM: Actual vs Predicted", xlab = "Index", ylab = "Deaths per Thousand")
lines(predictions, col = "red")
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "red"), lty = 1)

Neural Network

The below code uses neural network model using the total cases and recoveries for covid 19 for global data. The performance of this model is best as it has the lowest Root Mean Squared value among the models tested as part of this report.

nn = nnet(deaths_per_thou ~ cases_per_thou + recovered_per_thou, data = global_totals, size = 10, maxit = 1000, linout = TRUE)
## # weights:  41
## initial  value 353.561299 
## iter  10 value 254.469143
## iter  20 value 190.203043
## iter  30 value 159.484947
## iter  40 value 149.918189
## iter  50 value 145.741366
## iter  60 value 145.067310
## iter  70 value 144.214227
## iter  80 value 141.954202
## iter  90 value 131.189503
## iter 100 value 125.918014
## iter 110 value 120.917149
## iter 120 value 117.646201
## iter 130 value 111.305433
## iter 140 value 107.328426
## iter 150 value 101.221107
## iter 160 value 99.648573
## iter 170 value 98.911198
## iter 180 value 98.424001
## iter 190 value 97.529997
## iter 200 value 97.340605
## iter 210 value 97.331423
## iter 220 value 97.327366
## iter 230 value 97.318121
## iter 240 value 97.317079
## iter 250 value 97.316419
## final  value 97.316411 
## converged
predictions <- predict(nn, newdata = global_totals)

print(paste("Neural Network RMSE:", rmse(global_totals$deaths_per_thou, predictions)))
## [1] "Neural Network RMSE: 0.708259123738405"
plot(global_totals$deaths_per_thou, type = "l", col = "blue", main = "Neural Network: Actual vs Predicted", xlab = "Index", ylab = "Deaths per Thousand")
lines(predictions, col = "red")
legend("topright", legend = c("Actual", "Predicted"), col = c("blue", "red"), lty = 1)

Bias

While the data from Johns Hopkins University aggregates multiple sources, some cases may go unreported due to limitations in data availability and cross-country reporting standards. Data coverage is more comprehensive for the US compared to other countries, which could introduce bias.

On the analysis side, variable selection (cases, deaths, recovered) and certain model choices reflect analyst bias in focusing on specific aspects of the pandemic. Visualization and mapping may omit regions due to missing identifiers.

Conclusion

This report provides an overview and modeling of COVID-19 incidence and mortality trends both in the US and globally. Countries and US states with the highest and lowest case and death rates were identified. Neural Network models offered the best predictive accuracy for deaths per thousand among models tested. The analysis highlights patterns that can assist with public health planning and response, but should always be interpreted with awareness of data and modeling limitations.